Sample-efficient Nonstationary Policy Evaluation for Contextual Bandits

نویسندگان

  • Miroslav Dudík
  • Dumitru Erhan
  • John Langford
  • Lihong Li
چکیده

We present and prove properties of a new offline policy evaluator for an exploration learning setting which is superior to previous evaluators. In particular, it simultaneously and correctly incorporates techniques from importance weighting, doubly robust evaluation, and nonstationary policy evaluation approaches. In addition, our approach allows generating longer histories by careful control of a bias-variance tradeoff, and further decreases variance by incorporating information about randomness of the target policy. Empirical evidence from synthetic and realworld exploration learning problems shows the new evaluator successfully unifies previous approaches and uses information an order of magnitude more efficiently.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Evaluation Methodology for Assessing Off-Policy Learning Methods in Contextual Bandits

We propose a novel evaluation methodology for assessing off-policy learning methods in contextual bandits. In particular, we provide a way to use any given Randomized Control Trial (RCT) to generate a range of observational studies (with synthesized “outcome functions”) that can match the user’s specified degrees of sample selection bias, which can then be used to comprehensively assess a given...

متن کامل

Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

We study the off-policy evaluation problem— estimating the value of a target policy using data collected by another policy—under the contextual bandit model. We consider the general (agnostic) setting without access to a consistent model of rewards and establish a minimax lower bound on the mean squared error (MSE). The bound is matched up to constants by the inverse propensity scoring (IPS) an...

متن کامل

Resourceful Contextual Bandits

We study contextual bandits with ancillary constraints on resources, which are common in realworld applications such as choosing ads or dynamic pricing of items. We design the first algorithm for solving these problems that improves over a trivial reduction to the non-contextual case. We consider very general settings for both contextual bandits (arbitrary policy sets, Dudik et al. (2011)) and ...

متن کامل

Open Problem: First-Order Regret Bounds for Contextual Bandits

We describe two open problems related to first order regret bounds for contextual bandits. The first asks for an algorithm with a regret bound of Õ( √ L?K lnN) where there areK actions,N policies, andL? is the cumulative loss of the best policy. The second asks for an optimization-oracle-efficient algorithm with regret Õ(L ? poly(K, ln(N/δ))). We describe some positive results, such as an ineff...

متن کامل

On Minimax Optimal Offline Policy Evaluation

This paper studies the off-policy evaluation problem, where one aims to estimate the value of a target policy based on a sample of observations collected by another policy. We first consider the multi-armed bandit case, establish a minimax risk lower bound, and analyze the risk of two standard estimators. It is shown, and verified in simulation, that one is minimax optimal up to a constant, whi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012